Martín Bel,
University of Buenos Aires, m4rbel@gmail.com
PRIMARY
Nadia Romano, University of Buenos
Aires, nadia.romano@gmail.com
Delia Balaoi, University of Buenos Aires, delia.balaoi@gmail.com
Student
Team: YES
Did
you use data from both mini-challenges? NO
Tableau
The following open source
packages/libraries
R – data.table: M Dowle, T Short, S
Lianoglou, A Srinivasan
with contributions from R Saporta, E Antonyan
R – ggplot2: H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009.
Igraph - Csardi G, Nepusz T: The igraph software
package for complex network research, InterJournal,
Complex Systems 1695. 2006. http://igraph.org
R – Shiny:
http://shiny.rstudio.com
Plot.ly – D3 based library used
to make ggplot2 R objects interactive
Dygraphs.js
We have also used MySQL as a database
for part of the analysis.
Approximately how many hours were spent working on
this submission in total?
100 hours
May we post your submission in the Visual Analytics
Benchmark Repository after VAST Challenge 2015 is complete? YES
Video Download
Video:
http://www.youtube.com/watch?v=Kw8Z3HLBJ6Q
To download:
https://drive.google.com/file/d/0B27xCoMNYGfVbGxpQl9UNDBIS3c/view
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Questions
MC2.1 – Identify those IDs that stand out for their large volumes of communication. For each of these IDs
a. Characterize the communication patterns you see.
b. Based on these patterns, what do you hypothesize about these IDs?
Limit your response to no more than 4 images and 300 words.
The IDs that stand out for
their large volume of communications represent a 12,08% of the total volume of
the incoming and outbound messages. These IDs are: 1278894 and 839736.
We have computed the amount
of outbound communications using a five minute
interval. And plotted the amount of communications by day and id as represented
in Figure 1.
Figure 1: Outbound messages
from high volume IDs. Interactive
version
As we can see in Figure
1 there is a sharp increase in the amount of outbound messages from the ID
839736 around 12:00 p.m. on Sunday. And, a second and smaller
peak at 14 p.m. hs. This ID sends messages
from Entry Corridor to users. It sends these messages every minute of the day.
Our hypothesis is that this is a bot from the Park’s App that tries to find the
location of every user.
The second ID, 1278894, ( see Figure 2) presents a regular behaviour and sends messages from Entry Corridor. This ID
sends messages in a 5 minute interval at specific
hours: 12, 14, 16, 18 and 20 hs. The only peculiarity
we could find is that between 14:40 p.m. and 15:00 p.m. there is a decrease in
the amount of messages sent on Friday and Saturday, whereas in Sunday there is
no such drop. However, this behaviour doesn't seem to
be relevant to understanding the crime.
Figure 2: Outbound
messages from 1278894
So far we’ve analysed outbound messages from the highest volume IDs. In
this section, we’ll focus on incoming messages. We computed the amount of
messages received by ID 839736 on Sunday grouping by location and using a 5 minute range. See Figure 3.
Figure
3: Incoming messages received by the ID 839736. Interactive
version
As expected, there is a
high peak of incoming messages at 12 p.m. coming from Wet Land. We found a
second peculiarity. At 15 p.m., there is another peak coming from users in
Coaster Alley. If we compare the volume of incoming messages to this ID, Wet
Land appears as the higher source of communications.
From the previous analysis,
we can assert that a relevant event took place around 12 a.m. in Wet Land.
Considering Wet Land is an area located where the show took place, we believe
this is the moment the park authorities have detected the crime.
MC2.2 – Describe up to 10 communications patterns in the data. Characterize who is communicating, with whom, when and where. If you have more than 10 patterns to report, please prioritize those patterns that are most likely to relate to the crime.
Limit your response to no more than 10 images and 1000 words.
Based on our previous analysis we hypothesize that the crime was detected around 12 p.m. on Sunday. So now we’ll focus on the communication patterns that were found at this time.
When looking at the total amount of communications per day, hour and location we find a sharp increase in communications in Wet Land and Entry Corridor compared to previous days. It is also possible to see a clear drop in the messages sent from Wet Land comparing the amount of observations on Sunday versus Friday and saturday after 12 p.m., meaning that the area near the show was probably closed shortly after 12 p.m.
Figure 4: Total messages sent and received for each location. Interactive version
We can also see a peak in the number of outbound messages from Entry Corridor comparing to the previous days and a decrease in Wet land after the peak.
Now, we will take a closer look at who is communicating with whom around this time. When analyzing the IDs that communicated with external individuals, we found a few outliers. Around the time of the crime there is a peak in the amount of messages sent from Wet Land.
Figure 5: Total messages sent to External for each location. Interactive version
From the above images we hypothesize that the incident was discovered around 12 p.m. So, let’s zoom into the IDs that were more active at 11 a.m. These IDs present little amount of communications before 11 a.m., however at around 11:30 a.m. the amount of messages increase significantly. It’s possible to see a very strong activity from 11:30 to 11:47. After this time, two interesting facts can be spotted. One, activity drops abruptly. Second, all of the messages sent from these IDs were sent to external numbers after 11:47. Based on this pattern, we believe that these IDs were involved in the incident . See Figure 6.
Figure 6: Active IDs at 11 am Interactive Version
Now, we’ll focus on what happened after the incident. At 12 p.m., there is another peak of communications from Wet Land. Our hypothesis is that the incident was discovered around this time. We filtered those IDs with the highest level of activity at 12 p.m.. And, found 7 IDs that stand out from the rest. These IDs started sending messages to ID 839736 every minute until a few minutes after 12:30 p.m. See Figure 7
From these there are two IDs 38945 and 95112, which sent messages to ID 1278894 every few minutes as well. The IDs 731443 and 947320 communicated as well with 36486. The only ones that didn’t send messages to a different ID other than 839736 and 1278894 were 1092525 and 1601276. All of these IDs dropped communications to almost zero after 12:30 p.m. This group is probably part of the Park’s security team.
Figure 7:Active IDs at 12 p.m. Interactive version
We’ve identified the potential sources of disturbs (most active users at 11 from Wet Land - Figure 6) and the potential members of the park’s security (most active users at 12 from Wet Land - Figure 7). Our next step, is to try to find out if there is any evidence on how the vandalism was planned.
In order to do so, we identified those IDs that visited the park in both Friday and Sunday and communicated only to External sources (Figure 8). IDs 554218, 107490, 1685871 communicate from areas surrounding Wet Land and the road from the entrance towards Wet Land and Costey Alley. Our suspicion is that at least one of these 3 individuals came to the park on Friday to planify Sunday’s attack. ID 1965716 doesn’t seem suspicious because on Sunday, he sent messages only from Kiddie Land.
Figure 8: IDs that send message only to External during both days Friday and Sunday Interactive version
On Saturday and Sunday we found ten IDs that sent messages only to External sources. The possible mastermind(s) of the crime could have been at Wet Land and Coster Alley one day before the crime at a similar or close range of hours to planify the attack (IDs 708696 and 2030671). See Figure 9.
Figure 9: IDs that send message only to External during both days Saturday and Sunday Interactive version
Another pattern was detected when inspecting the amount of people that sent messages to groups as opposed to individuals. In Figure 10, the amount of messages that were sent to groups around 11:45 a.m. - 12:30 a.m. on Sunday is much higher than the rest of the days. It seems reasonable to think that during this time lapse an extraordinary event occurred, triggering group messages as a signal of alert from individuals towards their family, friends or coworkers. It’s interesting to note how the value falls and rises back 15 minutes later and then again at 14:00
Figure 10. Total amount of IDs that send messages to groups Interactive version
Figure 11 shows the average amount of time between messages per hour at the different locations of the park. During this time range, all of the averages behave similarly except for Wet Land. At the period 10 a.m.-13 p.m., the average is significantly lower than the average from the rest of the park. As the graph indicates time between messages, this lower average implies a significantly higher frequency of messages during those hours in the Wet land area.
Figure 11. Average amount of time between messages Interactive version
Analysing the amount of messages sent by visitors ( Figure 12) that were present the 3 days, we noticed that on Friday people communicated more from Coaster Alley, Entry Corridor and Kiddie Land than in the rest of the days. On the other hand, on Sunday after the peak from 11:45 a.m., communications dropped in Wet Land, probably because of the Pavilion closure. In Tundra Land, a location that is located right next to Wet Land, they increased. Our hypothesis is that after the incident, some parts of Wet Land were closed and people moved to Tundra Land.
Figure 12. Amount of messages sent by 3 days common IDs Interactive version
If we focus on the 10:00 a.m. - 15:00 p.m. interval on Sunday (in Figure 13), we can notice that around 11:45 a.m. the amount of messages increased in Wet Land. Immediately after, around 12:00 p.m. PM, it started to increase also in Entry Corridor. In both areas the amount of messages dropped close to 12:30 p.m.. We assume that the vandalism from Creighton Pavilion was noticed by the people nearby who started to alert their groups and also security. Around 11.00, there was a peak in Coaster Alley as well, but we assume is not related to the crime as the biggest quantity of those messages was sent to people from the same area.
Figure 13. Amount of messages sent by 3 days common IDs - 10:00 - 15:00 range Interactive version
MC2.3 – From this data, can you hypothesize when the crime
was discovered? Describe your rationale.
Limit your response to
no more than 3 images and 300 words
From the analysis of the
high volume ids in the first section, we believe the park authorities
communicated people in the park around 12.00 a.m. there was a problem in Wet
Land.
We also
found there were other patterns that show high amounts of
communication between 11:30 a.m. and 11:45 a.m. For example in Figure 14 there
is a strong peak in the amount of messages sent
from Wet Land while the amount of unique IDs that send them grows at a constant
rate. This reveals that some people started communicating more heavily.
Figure 14. Amount of
unique ids versus total amount of messages sent. Interactive
version
In order to understand when
the event was discovered, we looked into centrality metrics. Specifically we
traced its evolution. Figure 15 shows the following metrics: global clustering
coefficient, mean betweenness and Maximum degree.
These metrics were computed with a one minute window
for each timestamp revealing changes in the network structure.
The clustering
coefficient plot shows a similar pattern as found in Wet Land. Such a drastic
increase implies that people became heavily connected to their neighbours around the time of the incident. And, that the
connectivity of the network reached a maximum at around 12:00 p.m. (see peaks
in betweenness and maximum network degree)
Figure 15.
Centrality metrics computed with a one minute window
for each timestamp on sunday between 11:00 and 1 p.m.
Interactive version
The two remaining plots, reveal a similar patterns to those present in Fig.1.
We know that a singular event took place around 12 a.m, we know as well that it involved large amounts of
communications between Entry Corridor and Wet Land. And that
the high volume Ids are accountable for most of the traffic.
With these
considerations in mind, we believe that the incident took place between 11:30 a.m and 11:45 p.m.